5.13.1 Data adaptation for survival analysis
Survival analyses require the following:
-
A defined measurement period
-
A clear definition of the event for which the probability will be estimated
-
A prepared dataset that must contain the following variables:
- Time
- Event
The "time" variable must contain a measure of the time that has passed from a given start time to the specific event occurring. You can freely choose the measurement unit, e.g., days, weeks, months, or years. The only requirement is that "time" must be a numerical variable.
The "event" variable must also be numerical and contain the value 1 for individuals where the event has actually occurred during the given measurement period. For individuals where the event may not have occurred during this period, the value is set to 0. The latter is called "censored" cases. These are individuals where it is not possible to know if the event has occurred, either because it may have occurred after the measurement period was finished, or because they have disappeared from the population during the measurement period. It is not mandatory to specify the value 0; there may often be cases where "event" has the value 1 for all units (individuals).
Time and event can be calculated the following ways:
-
Using the import command
import-event
that allow you to define the event variable and measurement period and adds start dates for all events in your dataset. Then use the aggregation commandcollapse(min)
on the start date variable to find the time of the given event given a specific value on the variable you import throughimport-event
. -
Use ready-made date variables with fixed values per unit.
Example demonstrating data adaptation through the commands import-event
and collapse(min)
:
require no.ssb.fdb:23 as ds
textblock
Create dataset with relevant event based variable and define measurement period
endblock
create-dataset unemployed
import-event ds/ARBSOEK2001FDT_HOVED 2010-01-01 to 2019-12-15 as unempl_status
textblock
Keep all events where unemployment status = unemployed and date >= 2010
endblock
keep if unempl_status == '1' & START@unempl_status > date(2010,01,01)
textblock
Find first occurence of event and aggregate to person level
endblock
collapse (min) START@unempl_status, by(PERSONID_1)
//Create a small random selection (optional)
sample 10000 3245
textblock
Calculate the number of days from the start of measurement period until the first time an event occurs
endblock
generate days = START@unempl_status - date(2010,01,01)
replace days = 0 if days < 0
summarize days
histogram days
textblock
Create the event variable which will have the value 1 for all persons with a value for the number of days.
Those with a missing value or who has en event date after the end of the measurment period, will get the value 0 (also called "censored cases").
endblock
generate event = 1 if sysmiss(days) == 0
replace event = 0 if sysmiss(days) | START@unempl_status > date(2019,12,15)
textblock
Set the number of days to maximum value for persons who have not had an event occurrence during the measurement period. These are people who have survived the whole period without any event occurring. These will also get the event value = 0 through the prior step.
endblock
replace days = date(2019,12,15) - date(2010,01,01) if sysmiss(days)
tabulate event, summarize(days) mean freq
textblock
Create a year variable as an alternative to days
endblock
generate years = int(days/365.24)
tabulate years, missing
histogram years, discrete
summarize years event
histogram days
textblock
Import and adapt various explanatory variables in order to compare survival rates across population groups
endblock
import ds/BEFOLKNING_KJOENN as gender
import ds/BEFOLKNING_INVKAT as imm_cat
import ds/BEFOLKNING_FOEDSELS_AAR_MND as birthdate
generate age2010 = 2010 - int(birthdate/100)
generate agegroup = 1
replace agegroup = 2 if age2010 > 30
replace agegroup = 3 if age2010 > 50
define-labels agelabel 1 "Age 0-30" 2 "Age 31-50" 3 "Age 51 ->"
assign-labels agegroup agelabel
generate norwegian = 0
replace norwegian = 1 if imm_cat == 'A'
summarize norwegian
tabulate event norwegian
tabulate years norwegian
define-labels norlabel 0 "Foreign origin" 1 "Norwegian origin"
assign-labels norwegian norlabel
Example demonstrating data adaptation by using fixed date variables:
require no.ssb.fdb:23 as ds
textblock
Create dataset consisting of people aged 70 + and who are resident in Norway per 2010-01-01
endblock
create-dataset elder
import ds/BEFOLKNING_FOEDSELS_AAR_MND as birthdate
import ds/BEFOLKNING_STATUSKODE 2010-01-01 as regstat
generate age = 2010 - int(birthdate/100)
keep if age > 70 & regstat == '1'
textblock
Import a ready made date variable (fixed information): Death date
Perform modifications to transform dates into UnixTime-format by using the date() function
endblock
import ds/BEFOLKNING_DOEDS_DATO as deathdate
summarize deathdate
replace deathdate = string(deathdate)
generate yyyy = substr(deathdate,1,4)
generate mm = substr(deathdate,5,2)
generate dd = substr(deathdate,7,2)
destring yyyy
destring mm
destring dd
generate deathdate2 = date(yyyy,mm,dd)
summarize deathdate2
textblock
Calculate the number of days from 2010-01-01 to death date
endblock
generate days = deathdate2 - date(2010,01,01)
replace days = 0 if days < 0
textblock
Set the event value to 1 if the death date has a value bigger than 2010-01-01. Use 2023-01-01 as the max measurement date. Others are given the event value 0
endblock
generate event = 0
replace event = 1 if sysmiss(deathdate) == 0 & deathdate2 >= date(2010,01,01) & deathdate2 <= date(2023,01,01)
textblock
Set number of days to max value if no death date or death occurs after the last measurement date
endblock
replace days = date(2023,01,01) - date(2010,01,01) if sysmiss(days) | deathdate2 > date(2023,01,01)
tabulate event, summarize(days) mean freq
//Create a year version of time
generate year = int(days/365.24)
tabulate year
//Import gender to compare survival rates across genders
import ds/BEFOLKNING_KJOENN as gender